Hanseung Lee, Georgia Institute of Technology, hanseung.lee@gatech.edu [PRIMARY
contact]
Jaegul Choo, Georgia Institute of Technology, joyfull@cc.gatech.edu
Carsten Gorg, Georgia Institute of Technology, goerg@cc.gatech.edu
Jaeyeon Kihm, Georgia Institute of Technology, jkihm3@gatech.edu
Zhicheng Liu, Georgia Institute of Technology, zliu6@gatech.edu
Jaeeun Shim, Georgia Institute of Technology, jaeeun.shim@gatech.edu
Haesun Park, Georgia Institute of Technology, hpark@cc.gatech.edu
John Stasko, Georgia Institute of Technology, stasko@cc.gatech.edu
We
developed a system, GeneTracer, to visualize various data related with genetic
sequence. It has three views, Gene Sequence view, Disease Characteristic view,
and Graph view.
Gene
Sequence view visualizes the current outbreak sequences and native sequences.
l Colors of
each gene base: A (red), T (green), C (purple), G (blue)
l Heatmap
row vector (first row): represents how much a gene position has different gene
bases across different categories of characteristics
l Heatmap
column vector (first column): represents how much a sequence is different from
the selected row
l Interactions:
Removing/Moving column and row with mouse click/drag, multi selection with ctrl
key
Disease
Characteristic view visualizes each sequence's characteristics using colors.
The darker the value, the more severe it is.
l Color
coordinates: Symptoms (red), Mortality (blue), Complications (green),
Resistance (purple), Vulnerability (orange)
l Interactions:
Sort by each characteristic, index, or total weight, multi-selection for
reordering/removing sequences in Gene Sequence view
Gene Sequence view
and Disease Characteristic view can interact with each other.
l Selected
from one view, it’s also selected in the other
l Pressing
"Sync" or “Sync always” button from one view, the other view's
sequences are reordered to align with it.
Graph view
visualizes the relations among the sequences. (also shows the MST)
l Node:
sequences' indices and countries of native sequences
l Weight of
edges: hamming distance
Graph view
can also interact with other views. This was implemented based on open source
JUNG, the Java Universal Network/Graph Framework.
Video:
ANSWERS:
MC3.1:
What is the region or country of origin for the current outbreak? Please provide your answer as the name of the
native viral strain along with a brief explanation.
Nigeria_B is the
country of origin for the current outbreak. GeneTracer first removes the
identical gene bases across all the sequences, thus giving us a much more
manageable number of bases. Next, GeneTracer constructed the graph and
calculated the Minimum Spanning Tree (MST) in the Graph view. In figure 2, we
found that Nigeria_B is the nearest native sequence from the current outbreaks,
and thus it could be a candidate. Also in the Gene Sequence view (figure 1), we
can interactively reorder rows and columns by dragging them upwards closer to
the outbreak sequences making the comparison easier. In addition, we have a
filter operation to remove some of the sequences that were clearly dissimilar.
By interactively exploring the data in this manner we found that Nigeria_B was
by far the most similar sequence to the current outbreak, which matched with
the result of the Graph view.
MC3.2:
Over time, the virus spreads and the diversity of the virus increases as it
mutates. Two patients infected with the
Drafa virus are in the same hospital as Nicolai. Nicolai has a strain identified by sequence
583. One patient has a strain identified
by sequence 123 and the other has a strain identified by sequence 51. Assume only a single viral strain is in each
patient. Which patient likely contracted
the illness from Nicolai and why? Please
provide your answer as the sequence number along with a brief explanation.
To solve the second
problem, we only need to observe the strains identified by sequence 583, 123,
and 51. We gathered these three sequences at the top of the Gene Sequence view
using mouse drag and drop interactions. By observing the heatmap column vector,
we found that 583 have a larger similarity (lighter color) with 123 than that
with 51. We also filtered the columns that have the same gene bases among the
three sequences. As a result, we got only four columns as shown in <Figure
3>. From this analysis, we found that
sequence 123 has only one different gene base (column index 269) whereas
sequence 51 has three different gene bases (column index 494, 842, and 946)
compared to gene sequence 583. Therefore, we can conclude that the patient that
has a strain identified by sequence 123 is likely to be contracted from Nicolai
(sequence 583).
MC3.3:
Signs and symptoms of the Drafa virus are varied and humans react differently
to infection. Some mutant strains from
the current outbreak have been reported as being worse than others for the
patients that come in contact with them.
Identify
the top 3 mutations that lead to an increase in symptom severity (a disease
characteristic). The mutations involve
one or more base substitutions. For this
question, the biological properties of the underlying amino acid sequence
patterns are not significant in determining disease characteristics.
For each
mutation provide the base substitutions and their position in the sequence
(left to right) where the base substitutions occurred. For example,
C → G, 456
(C changed to G at position 456)
G → A, 513
and T → A, 907 (G changed to A at position 513 and T changed to A at position
907)
A → G, 39
(A changed to G at position 39)
Answers: <Figure
4, 5>
1) A → C, 269
2) A → T, 946 and T → C, 842
3) A → G, 223
Mutation A → C, 269
only occurred in severe symptoms (at sequence 99, 118, 123 and 997), so it’s
strongly related with symptom severity.
Mutation A → T, 946
and T → C, 842 occurred highly in the severe and moderate symptoms. Notice even
if T → C, 842 occurs, if A → T, 946 doesn't, it lies in mild symptom (e.g.,
sequence 49 and 961). Therefore, if these two mutations occur at the same time,
it increases the symptom's severity.
For the third
mutation, we obtained three candidates, A → G, 223, A → C, 197 and G → C, 212.
We finally selected A → G, 223 since the last two candidates’ changed sequences
are overlapped with other two mutations we first found.
MC3.4:
Due to the rapid spread of the virus and limited resources, medical personnel
would like to focus on treatments and quarantine procedures for the worst of
the mutant strains from the current outbreak, not just symptoms as in the
previous question. To find the most
dangerous viral mutants, experts are monitoring multiple disease
characteristics.
Consider
each virulence and drug resistance characteristic as equally important. Identify the top 3 mutations that lead to the
most dangerous viral strains. The mutations involve one or more base
substitutions. In a worst case scenario,
a very dangerous strain could cause severe symptoms, have high mortality, cause
major complications, exhibit resistance to anti viral drugs, and target high
risk groups. For this question, the
biological properties of the underlying amino acid sequence patterns are not
significant in determining disease characteristics.
For each
mutation provide the base substitutions and their position in the sequence
(left to right) where the base substitutions occurred. For example,
C → G, 456
(C changed to G at position 456)
G → A, 513
and T → A, 907 (G changed to A at position 513 and T changed to A at position
907)
A → G, 39
(A changed to G at position 39).
Answers:
1) A → T, 946 and T → C, 842
2) T → C, 790
3) A → G, 223
Unlike the previous
problem, here we had to consider not only symptoms but also all the other characteristics.
Our approach consists of three steps.
For the first step,
we reordered the Gene Sequence view with each sorted result from the Disease
Characteristic view. We used both views together with interactions. We sorted
and reordered the sequence based on the characteristics and then we applied
this order to the Gene Sequence view using the “Sync” function key. In this
process, our tool shows the boundary of different levels of a certain
characteristic with a blue thick line in the Gene Sequence view. Then, we
reorganized the gene sequence view by moving the columns that have large
variance to the left side. Also, we changed the order of sequences (rows) to
place similar gene bases together within the same category of characteristics.
Through these interactive steps, we could see the patterns of gene sequences
and find some potential mutations that are critical for each characteristic. As
a result, we could find two to five candidates of critical mutations per each
characteristic. Some examples are shown in figures 7- 8.
The second step is
to do the same process as in the first step, except that the gene sequence is
sorted by total weight. We already assigned one to three to the weight for each
characteristic value, and the total weight is determined by aggregating all the
characteristic weight. In this gene view we reorganized based on the total
weight. Then we focused on critical mutation candidates from the first strategy
and analyzed again by moving, removing, or filtering columns/rows.
Finally, the third
step is to support the speed to find and verify our answer. We used several
data mining techniques to choose which genes contain the significant
information. Regression and decision trees were helpful to give the clues to
the answers. From the results, we chose the column indices with the highest
coefficient values and also examined a few nodes near the root of the decision
tree. We assigned a value for each gene base and treated it as unordered
categorical variables. In this step, we also assigned 1 for mild symptoms, 2
for moderate symptoms, and 3 for severe symptoms. Therefore, for 58 gene
sequences each with 1404 gene bases, we created 58-by-1404 matrix X of
predictor values along with vector Y consist of 58 response values. Since a
decision tree allows combining gene bases that have similar values with respect
to the level of some target value, there is less information loss in collapsing
gene bases together. This leads to an improved classification result. Linear
regression was also used for users to explore initial column indices (gene base
positions) that could be potential answers. We formed the matrix that consists
of tuples of pairs of gene sequences, and the position where base substitution
occurred. Data is the pair of indices and we are trying to find the pair with
the highest influence on the total weight of characteristics. We did it for the
first-order linear regression, second-order linear regression, and for the
first- and second-order linear regression combined. We selected the top few
coefficients with the corresponding column indices and started doing visual
analytic work with our tool. Even though these techniques didn't give the
answers we expected, but starting from some of the suggested column indices
supported our visualization tool to work better. Also, we can verify the answer
we made from qualitative decision making process is correct from quantitative
results.
From the first
step, we could get the following candidate mutations.
Symptoms: A → T at
946 and T → C at 842 / A → C at 269 / A → G at 223
Mortality: A → T at 946 and T → C at 842 / T → C at 790 / A → C at 269
Complication: T → C at 790/ A → G at 223
Resistance: T → C at 790 / A → T at 946 and T → C at 842 / A → C at 269 / A → G
at 223 / A → C at 197
Vulnerability: A → T at 946 and T → C at 842 / G → C at 212
In the second step,
we could analyze the Gene Sequence view such as <Figure 6>. By
synthesizing with first step's candidates, we could determine the top three
mutations considering the whole characteristics. By looking at figure 6, let’s
see squares which have both green in position 946 and purple in position 842. We
can see that those squares are mostly crowded on the upper side which means
they have large total weights. Also we can see purple squares in position 790
and check that they are also mostly on the upper side of this view. At last, we
can check on the blue squares in position 223. The mutation that occurred at
position 223 have at least a total weight of 11. This means this kind of
mutation is very dangerous. In conclusion, we can also check and verify the
answers using visualization very easily. Making the same colors gather together
make the user understand more easily.
In addition, we can
see some interesting parts too. For example, if a G → C mutation occurs in
position 22, the total weight is quite low. This means that this mutation lead
to a stable strain and can be a potential cure to the current outbreak.
In the third step,
we verified that the answer was correct using some results from regression and
decision trees. For example, figure 9 is a decision tree output which has some
nodes near the root which are some column indices from the answer. This means,
in these column indices, mutated and non-mutated gene bases mostly discriminate
the category of characteristics. Therefore, we can conclude that the answer is
correct.
Figures:
Figure 1:
fig1_MC3_1.jpg
Figure 2: fig2_MST.jpg
Figure 3: fig3_MC3_2.jpg
Figure 4:
fig4_MC3_3.jpg
Figure 5:
fig5_MC3_3.jpg
Figure 6:
fig6_MC3_4.jpg
Figure 7:
fig7_MC3_4.jpg
Figure 8: fig8_MC3_4.jpg
Figure 9:
fig9_decision_tree.jpg
Figures (original files):
Figure 1:
fig1_MC3_1.jpg
Figure 2: fig2_MST.jpg
Figure 3: fig3_MC3_2.jpg
Figure 4:
fig4_MC3_3.jpg
Figure 5: fig5_MC3_3.jpg
Figure 6: fig6_MC3_4.jpg
Figure 7: fig7_MC3_4.jpg
Figure 8:
fig8_MC3_4.jpg
Figure 9: fig9_decision_tree.jpg